Goto

Collaborating Authors

 return function



Test-driven Reinforcement Learning in Continuous Control

Yu, Zhao, Wu, Xiuping, Ke, Liangjun

arXiv.org Artificial Intelligence

Reinforcement learning (RL) has been recognized as a powerful tool for robot control tasks. RL typically employs reward functions to define task objectives and guide agent learning. However, since the reward function serves the dual purpose of defining the optimal goal and guiding learning, it is challenging to design the reward function manually, which often results in a suboptimal task representation. To tackle the reward design challenge in RL, inspired by the satisficing theory, we propose a Test-driven Reinforcement Learning (TdRL) framework. In the TdRL framework, multiple test functions are used to represent the task objective rather than a single reward function. Test functions can be categorized as pass-fail tests and indicative tests, each dedicated to defining the optimal objective and guiding the learning process, respectively, thereby making defining tasks easier. Building upon such a task definition, we first prove that if a trajectory return function assigns higher returns to trajectories closer to the optimal trajectory set, maximum entropy policy optimization based on this return function will yield a policy that is closer to the optimal policy set. Then, we introduce a lexicographic heuristic approach to compare the relative distance relationship between trajectories and the optimal trajectory set for learning the trajectory return function. Furthermore, we develop an algorithm implementation of TdRL. Experimental results on the DeepMind Control Suite benchmark demonstrate that TdRL matches or outperforms handcrafted reward methods in policy training, with greater design simplicity and inherent support for multi-objective optimization. We argue that TdRL offers a novel perspective for representing task objectives, which could be helpful in addressing the reward design challenges in RL applications.



Modeling Human Beliefs about AI Behavior for Scalable Oversight

Lang, Leon, Forré, Patrick

arXiv.org Artificial Intelligence

Contemporary work in AI alignment often relies on human feed back to teach AI systems human preferences and values. Yet as AI systems grow more cap able, human feedback becomes increasingly unreliable. This raises the problem o f scalable oversight: How can we supervise AI systems that exceed human capabilities? In t his work, we propose to model the human evaluator's beliefs about the AI system's be havior to better interpret the human's feedback. We formalize human belief models and theo retically analyze their role in inferring human values. We then characterize the remaining ambiguity in this inference and conditions for which the ambiguity disappears. To mitigate reliance on exact belief models, we then introduce the relaxation of human belief model cover ing. Finally, we propose using foundation models to construct covering belief models, pro viding a new potential approach to scalable oversight.


When Your AIs Deceive You: Challenges with Partial Observability of Human Evaluators in Reward Learning

Lang, Leon, Foote, Davis, Russell, Stuart, Dragan, Anca, Jenner, Erik, Emmons, Scott

arXiv.org Machine Learning

Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to result in policies that deceptively inflate their performance, overjustify their behavior to make an impression, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function and thus the optimal policy, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and propose research directions to help tackle these challenges.


Beyond Exponentially Discounted Sum: Automatic Learning of Return Function

Wang, Yufei, Ye, Qiwei, Liu, Tie-Yan

arXiv.org Machine Learning

In reinforcement learning, Return, which is the weighted accumulated future rewards, and Value, which is the expected return, serve as the objective that guides the learning of the policy. In classic RL, return is defined as the exponentially discounted sum of future rewards. One key insight is that there could be many feasible ways to define the form of the return function (and thus the value), from which the same optimal policy can be derived, yet these different forms might render dramatically different speeds of learning this policy. In this paper, we research how to modify the form of the return function to enhance the learning towards the optimal policy. We propose to use a general mathematical form for return function, and employ meta-learning to learn the optimal return function in an end-to-end manner. We test our methods on a specially designed maze environment and several Atari games, and our experimental results clearly indicate the advantages of automatically learning optimal return functions in reinforcement learning.


Hierarchical Policy Search via Return-Weighted Density Estimation

Osa, Takayuki (University of Tokyo / RIKEN) | Sugiyama, Masashi (RIKEN / University of Tokyo)

AAAI Conferences

Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchicalpolicy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a motion planning problem of a redundant robotic manipulator.


Hierarchical Policy Search via Return-Weighted Density Estimation

Osa, Takayuki, Sugiyama, Masashi

arXiv.org Machine Learning

Learning an optimal policy from a multi-modal reward function is a challenging problem in reinforcement learning (RL). Hierarchical RL (HRL) tackles this problem by learning a hierarchical policy, where multiple option policies are in charge of different strategies corresponding to modes of a reward function and a gating policy selects the best option for a given context. Although HRL has been demonstrated to be promising, current state-of-the-art methods cannot still perform well in complex real-world problems due to the difficulty of identifying modes of the reward function. In this paper, we propose a novel method called hierarchical policy search via return-weighted density estimation (HPSDE), which can efficiently identify the modes through density estimation with return-weighted importance sampling. Our proposed method finds option policies corresponding to the modes of the return function and automatically determines the number and the location of option policies, which significantly reduces the burden of hyper-parameters tuning. Through experiments, we demonstrate that the proposed HPSDE successfully learns option policies corresponding to modes of the return function and that it can be successfully applied to a challenging motion planning problem of a redundant robotic manipulator.


Optimal Rewards versus Leaf-Evaluation Heuristics in Planning Agents

Sorg, Jonathan (University of Michigan) | Singh, Satinder (University of Michigan) | Lewis, Richard L. (University of Michigan)

AAAI Conferences

Planning agents often lack the computational resources needed to build full planning trees for their environments. Agent designers commonly overcome this finite-horizon approximation by applying an evaluation function at the leaf-states of the planning tree. Recent work has proposed an alternative approach for overcoming computational constraints on agent design: modify the reward function. In this work, we compare this reward design approach to the common leaf-evaluation heuristic approach for improving planning agents. We show that in many agents, the reward design approach strictly subsumes the leaf-evaluation approach, i.e., there exists a reward function for every leaf-evaluation heuristic that leads to equivalent behavior, but the converse is not true. We demonstrate that this generality leads to improved performance when an agent makes approximations in addition to the finite-horizon approximation. As part of our contribution, we extend PGRD, an online reward design algorithm, to develop reward design algorithms for Sparse Sampling and UCT, two algorithms capable of planning in large state spaces.


On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient

Jie, Tang, Abbeel, Pieter

Neural Information Processing Systems

Likelihood ratio policy gradient methods have been some of the most successful reinforcement learning algorithms, especially for learning on physical systems. We describe how the likelihood ratio policy gradient can be derived from an importance sampling perspective. This derivation highlights how likelihood ratio methods under-use past experience by (a) using the past experience to estimate {\em only} the gradient of the expected return $U(\theta)$ at the current policy parameterization $\theta$, rather than to obtain a more complete estimate of $U(\theta)$, and (b) using past experience under the current policy {\em only} rather than using all past experience to improve the estimates. We present a new policy search method, which leverages both of these observations as well as generalized baselines---a new technique which generalizes commonly used baseline techniques for policy gradient methods. Our algorithm outperforms standard likelihood ratio policy gradient algorithms on several testbeds.